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A Method for Compressing Parameters in Bayesian Models with 
Application to Logistic Sequence Prediction Models* 

Longhai Li^ and Radford M. Neal* 

Abstract. Bayesian classification and regression with high order interactions is largely infeasible 
because Markov chain Monte Carlo (MCMC) would need to be applied with a great many 
parameters, whose number increases rapidly with the order. In this paper we show how to 
make it feasible by effectively reducing the number of parameters, exploiting the fact that many 
q ■ interactions have the same values for all training cases. Our method uses a single "compressed" 
CN '. parameter to represent the sum of all parameters associated with a set of patterns that have 
the same value for all training cases. Using symmetric stable distributions as the priors of the 
original parameters, we can easily find the priors of these compressed parameters. We therefore 
need to deal only with a much smaller number of compressed parameters when training the model 
with MCMC. The number of compressed parameters may have converged before considering the 
highest possible order. After training the model, we can split these compressed parameters 
into the original ones as needed to make predictions for test cases. We show in detail how to 
compress parameters for logistic sequence prediction models. Experiments on both simulated 
and real data demonstrate that a huge number of parameters can indeed be reduced by our 
C3 ■ compression method. 

C/3 ! 

1 Introduction 

m : 
oo . 

q\ . In many classification and regression problems, the response variable y depends on high-order 
interactions of "features" (also called "covariates" , "inputs" , "predictor variables" , or "explana- 
tory variables"). Some complex human diseases are found to be related to high-order interactions 
■ of susceptibility genes and environmental exposures (Ritchie et. al. 2001). The prediction of the 
next character in English text is improved by using a large number of preceding characters (Bell, 
>■ . Cleary and Witten 1990). Many biological sequences have long-memory properties. 

When the features are discrete, we can employ high-order interactions in classification and 
c3 • regression models by introducing, as additional predictor variables, the indicators for each pos- 
sible interaction pattern, equal to 1 if the pattern occurs for a subject and otherwise. In this 
paper we will use "features" for the original discrete measurements and "predictor variables" for 
these derived variables, to distinguish them. The number of such predictor variables increases 
exponentially with the order of interactions. The total number of order-k interaction patterns 
with k binary (0/1) features is 2 fc , accordingly we will have 2 k predictor variables. A model with 
interactions of even a moderate order is prohibitive in real applications, primarily for computa- 
tional reasons. People are often forced to use a model with very small order, say only 1 or 2, 
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which, however, may omit useful high-order predictor variables. 

Besides the computational considerations, classification and regression with a great many 
predictor variables may "overfit" the data. Unless the number of training cases is much larger 
than the number of predictor variables the model may fit the noise instead of the signal in the 
data, with the result that predictions for new test cases are poor. This problem can be solved 
by using Bayesian modeling with appropriate prior distributions. In a Bayesian model, we use 
a probability distribution over parameters to express our prior belief about which configurations 
of parameters may be appropriate. One such prior belief is that a parsimonious model can 
approximate the reality well. In particular, we may believe that most high-order interactions are 
largely irrelevant to predicting the response. We express such a prior by assigning each regression 
coefficient a distribution with mode 0, such as a Gaussian or Cauchy distribution centered at 
0. Due to its heavy tail, a Cauchy distribution may be more appropriate than a Gaussian 
distribution to express the prior belief that almost all coefficients of high order interactions are 
close to 0, with a very small number of exceptions. Additionally, the priors we use for the widths 
of Gaussian or Cauchy distributions for higher order interaction should favor small values. The 
resulting joint prior for all coefficients favors a model with most coefficients close to 0, that is, 
a model emphasizing low order interactions. By incorporating such prior information into our 
inference, we will not overfit the data with an unnecessarily complex model. 

However, the computational difficulty with a huge number of parameters is even more pro- 
nounced for a Bayesian approach than other approaches, if we have to use Markov chain Monte 
Carlo methods to sample from the posterior distribution, which is computationally burdensome 
even for a moderate number of parameters. With more parameters, a Markov chain sampler will 
take longer for each iteration and require more memory, and may need more iterations to con- 
verge or get trapped more easily in local modes. Applying Markov chain Monte Carlo methods 
to classification and regression with high-order interactions therefore seems infeasible. 

In this paper, we show how these problems can be solved by effectively reducing the number of 
parameters in a Bayesian model with high-order interactions, using the fact that in a model that 
uses all interaction patterns, from a low order to a high order, many predictor variables have 
the same values for all training cases. For example, if an interaction pattern occurs in only one 
training case, all the interaction patterns of higher order contained in it will also occur in only 
that case and have the same values for all training cases — 1 for that training case and for all 
others. Consequently, only the sum of the coefficients associated with these predictor variables 
matters in the likelihood function. We can therefore use a single "compressed" parameter to 
represent the sum of the regression coefficients for a group of predictor variables that have the 
same values in training cases. For models with very high order of interactions, the number of 
such compressed parameters will be much smaller than the number of original parameters. If 
the priors for the original parameters are symmetric stable distributions, such as Gaussian or 
Cauchy, we can easily find the prior distributions of these compressed parameters, as they are also 
symmetric stable distributions of the same type. In training the model with Markov chain Monte 
Carlo methods we need to deal only with these compressed parameters. After training the model, 
the compressed parameters can be split into the original ones as needed to make predictions for 
test cases. Using our method for compressing parameters, one can handle Bayesian regression 
and classification problems with very high order of interactions in a reasonable amount of time. 

This paper will be organized as follows. In Section [2] we describe in general terms the method 
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of compressing parameters, and how to split them to make predictions for test cases. We then 
apply the method to logistic sequence models in Section O There, we will describe the specific 
schemes for compressing parameters for the sequence prediction models, and use simulated data 
and real data to demonstrate our method. We draw conclusions and discuss future work in 
Section HI 

The software package (using R as interface but with most functions written in C) for the 
method described in this paper is available from http://math.usask.ca/~longhai. 



2 Our Method for Compressing Parameters 
2.1 Compressing Parameters 

Our method for compressing parameters is applicable when we can divide the regression coef- 
ficients used in the likelihood function into a number of groups such that the likelihood is a 
function only of the sums over these groups. The groups will depend on the particular training 
data set. An example of such a group is the regression coefficients for a set of predictor variables 
that have the same values for all training cases. It may not be easy to find an efficient scheme 
for grouping the parameters of a specific model. We will describe how to group the parameters 
for sequence prediction models in Section [3] Suppose the number of such groups is G. The 
parameters in group g are denoted by (3 g i, . . . , /3 g , ng , and the sum of them is denoted by s g : 

n g 

S 9 = ^2^ for 9 = 1, • • • ,G (1) 

k=l 

We assume that the likelihood function can be written as: 

L • • • , /3l,ni> ••• > PGXi ■ ■ ■ ) ^G,n G ) 

(ni n G \ 

J^Pik, ...,X> G * =L(s u ... ,s G ) (2) 
k=l k=l / 

Note that the above /3's are only the regression coefficients for the interaction patterns occurring 
in training cases. The predictive distribution for a test case may use extra regression coefficients, 
whose distributions depend only on the priors given relevant hyperparameters. 

We need to define priors for the f3 g k in a way that lets us easily find the priors of the s g . For this 
purpose, we assign each f5 g k a symmetric stable distribution centered at with width parameter 
a g k- Symmetric stable distributions (Feller 1966) have the following additive property: If random 
variables X±, . . . ,X n are independent and have symmetric stable distributions of index a, with 
location parameters and width parameters <Ti, . . . , a n , then the sum of these random variables, 
Y^i=\ Xii a l so has a symmetric stable distribution of index a, with location parameter and width 
parameter (J^™ =1 a") 1 / . Symmetric stable distributions exist and are unique for a G (0, 2]. The 
symmetric stable distributions with a = 1 are Cauchy distributions. The density function of 
a Cauchy distribution with location parameter and width parameter a is [tt<j(1 + x 2 /<t 2 )] -1 . 
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/»" J5 2 /5 1 

Figure 1: A picture depicting the sampling procedure after compressing parameters. 

The symmetric stable distributions with a = 2 are Gaussian distributions, for which the width 
parameter is the standard deviation. Since the symmetric stable distributions with a other than 
1 or 2 do not have closed form density functions, we will use only Gaussian or Cauchy priors. 
That is, each parameter f3 gk has a Gaussian or Cauchy distribution with location parameter 
and width parameter a gk : 



Pgk ~ N(0, a gk ) or (3 gk ~ Cauchy(0, a gk ) (3) 

Some a g k may be common for different j3 g k, but for the moment we denote them individually. 
We might also treat the a g kS as unknown hyperparameters, but again we assume them fixed for 
the moment. 

If the prior distributions for the f3 g kS are as in ([3]), the prior distribution of s g can be found 
using the property of symmetric stable distributions: 



S 9 ~ N ( ' OT S9 ~ CaUChy ( 0) j^^^j ^ 

Let us denote the density of s g in (TjJ by (either a Gaussian or Cauchy), and denote 
si, . . . , sq collectively by s. The posterior distribution can be written as follows: 

P(s I V) = ^ L(s u . . . , s G ) P^si) ■ ■ ■ P°(s G ) (5) 

where T> is the training data, and c(T>) is the marginal probability or density function of T>. 

Since the likelihood function L(si, . . . , sq) typically depends on s%, . . . , s G in a complicated 
way, we may have to use some Markov chain sampling method to sample for s from distribu- 
tion (GO). 



2.2 Splitting Compressed Parameters 

After we have obtained samples of s g , probably using some Markov chain sampling method, we 
may need to split them into their original components f3 g i, . . . , (3 gyrig to make predictions for test 
cases. This "splitting" distribution depends only on the prior distributions, and is independent of 
the training data T>. In other words, the splitting distribution is just the conditional distribution 
of /3 gl , . . . , (3 gng given J2k=i @gk = s g , whose density function is: 
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P(/3gl, • • • , Pg,n a -1 I s g) ~ 

where P g k is the density function of the prior for (3 g k- Note that /3 0jng is omitted since it is equal 

As will be discussed in the Section 12.44 sampling from can be done efficiently by a direct 
sampling method, which does not involve costly evaluations of the likelihood function. We 
need to use Markov chain sampling methods and evaluate the likelihood function only when 
sampling for s. Figure [T] shows the sampling procedure after compressing parameters, where 
/3 is a collective representation of (3 g k, for g = 1, . . . , G, k = 1, . . . , n g — 1. When we consider 
high-order interactions, the number of groups, G, will be much smaller than the number of /3 g kS. 
This procedure is therefore much more efficient than applying Markov chain sampling methods 
to all the original (5 g k parameters. 

Furthermore, when making predictions for a particular test case, we actually do not need to 
sample from the distribution ([6]), of dimension n g — 1, but only from a derived 1-dimensional 
distribution, which saves a huge amount of space. 

Before discussing how to sample from we first phrase this compressing-splitting procedure 
more formally in the next section to show its correctness. 



n p M 



>gk) 



k=l 



P. 



9,n g 



(6) 



k=l 



2.3 Correctness of the Compressing-Splitting Procedure 

The above procedure of compressing and splitting parameters can be seen in terms of a transfor- 
mation of the original parameters /3 g k to a new set of parameters containing s 9 's, as defined in ([I]), 
in light of the training data. The posterior distribution §5§ of s and the splitting distribution fl§]) 
can be derived from the joint posterior distribution of the new parameters. 

The invertible mappings from the original parameters to the new parameters are shown as 
follows, for g = 1, . . . , G, 

n g 

(0gl, • • • , ftg,n g -U Pg,n g ) =^ (ftgli • • ■ > ftg,n a -li Pgk) = (ftgl, • ■ ■ i Pg,n a -1, S g ) (7) 

k=l 

In words, the first n g — 1 original parameters f3 g kS are mapped to themselves (we might use another 
set of symbols, for example b g k, to denote the new parameters, but here we still use the old ones 
for simplicity of presentation while making no confusion), and the sum of all /3 g ,kS, is mapped 
to s g . Let us denote the new parameters j3 g k, for g — 1, . . . , G, k — 1, . . . , n g — 1, collectively by 
(3, and denote si, . . . , s g by s. (Note that j3 does not include (3 g , ng , for g = 1, . . . ,G. Once we 
have obtained the samples of s and (3 we can use P g ,n g — s g — X^li* Pgk t° obtain the samples 
of /W) 

The posterior distribution of the original parameters, (3 g k, is: 



n\ n G \ G ri g 

p(Pn, ... ,p G> nc \v)=^L[j2^--->Y,fck nn p Mk) 

^ ' \k=\ k=l / 0=1 fc=l 
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By applying the standard formula for the density function of transformed random variables, we 
can obtain from (jSJ) the posterior distribution of the s and (3: 



G 



P(s,(3\V) = -^-L( Sl ,...,s G )f[ 

^ ' g=l 



n„ — l 



n p M 



gk) 



k=l 



n„— 1 



P 9,ng I S 9 



J2(3 gk ) l det ( J )l ( 9 ) 



A = l 



where the | det(J)| is absolute value of the determinant of the Jacobian matrix, J, of the map- 
ping (J7J), which can be shown to be 1. 

Using the additive property of symmetric stable distributions, which is stated in section [2TT| we 
can analytically integrate out (3 in P(s, (3 \ V), resulting in the marginal distribution P(s | V>\. 



P(s | V) = J P(s,/3 | V)d(3 



ciV) 

G 

n 

9=1 
1 

W) 



L (s 1 , 



sg) 



n„-l 



n 



k=l 



n a -l 



k=l 



L (si, s G ) Pf(si 



PgM 



The conditional distribution of f3 given T> and s can then be obtained as follows: 



(10) 



P 9,n 3 \Sg ~ Yl P9k) dpgl • ■ ■ df3 g , ng - X (11) 



(12) 



P({3 | s,V) = P(s,(3 | V)/P{s | V) 



G 

n 

9=1 



'ra — 1 



[ Pgk(Pgk) 



k=l 



n a —l 



P 9 ,n a [sg-J2Pgk) /P°(s g 



k = l 



(13) 
(14) 



From the above expression, it is clear that P(f3 \ s,T>) is unrelated to T>, i.e., P({3 \ s,T>) = 
P(f3 | s), and is independent for different groups. Equation §6§ gives this distribution only for 
one group g. 



2.4 Sampling from the Splitting Distribution 

In this section, we discuss how to sample from the splitting distribution ([SD to make predictions 
for test cases after we have obtained samples of S\, . . . , sg- 

If we sampled for all the /3 g k% storing them would require a huge amount of space when the 
number of parameters in each group is huge. We therefore sample for (3 conditional on si, . . . , sg 
only temporarily, for a particular test case. As will be seen in Section [3], the predictive function 
needed to make prediction for a particular test case, for example the probability that a test case 
is in a certain class, depends only on the sums of subsets of flgk's in groups. After re-indexing the 
/3 g kS in each group such that the @ g i, . . . , f3 g ,t g are those needed by the test case, the variables 
needed for making a prediction for the test case are: 
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b 9 



Y,(3 9 k, ior g = l,...,G, (15) 



fc=i 



Note that when t g = 0, s g = 0, and when t g = n g , s g = s g . The predictive function may also 
use some sums of extra regression coefficients associated with the interaction patterns that occur 
in this test case but not in training cases. Suppose the extra regression coefficients need to be 
divided into Z groups, as required by the form of the predictive function, which we denote by 
. . . , P* tTl *, ■ ■ ■ , Pz,ii ■ ■ ■ 5 P*z,n* z - The variables needed for making prediction for the test cases 

are: 

n t 

< = XX> for*=l,.",2 (16) 
k=i 

In terms of the above variables, the function needed to make a prediction for a test case can 
be written as 

to 

j^Pik, ... ,J2^ k , x>v ... =< s ^ ■ ■■ ••• >4) (17) 

fc=i /c=i fc=i fe=i 

Let us write s\, . . . , Sq collectively as s*, and write s*, . . . , s* z as s*. The integral required to 
make a prediction for this test case is 

r G 

/ a(s*,«*) P(s*) P(s | P) [ P(s* | s 9 ) ds ds*ds*. (18) 

^ 9=1 

The integral over s* is done by MCMC. We also need to sample for s* from P(s*), which is 
the prior distribution of s* given some hyperparameters (from the current MCMC iteration) and 
can therefore be sampled easily. Finally, we need to sample from P(s* | s g ), which can be derived 
from ([6]), shown as follows: 

P(4 I s g ) = P«(4) Pl*>{a g - s g )/P s g {s g ) (19) 

where P g and P g 2 ^ are the priors (either Gaussian or Cauchy) of figk and Yltg+i figki respec- 
tively. We can obtain (fT9|) from analogously as we obtained the density of s g , that is, by first 
mapping (3 and s to a set of new parameters containing s and s 4 , then integrating away other 
parameters, using the additive property of symmetric stable distributions. The distribution (fl~9l) 
splits s g into two components. 

When the priors for the f3 g kS are Gaussian distributions, the distribution ( |T9l is also a Gaussian 
distribution, given as follows: 

Y 2 / T 2 

^1 v2 I 1 ^1 



4 K ~ ^ *9 ^7^2 > ^ 1 - (20) 



J 2 
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where E 2 = Y^k=i a gk an d E 2 , = J2t 3 +i a gk- Since (120]) is a Gaussian distribution, we can sample 
from it by standard methods. 

When we use Cauchy distributions as the priors for the f3 g kS, the density function of (fl9~l) is: 



P(sl 



s g) = - 



C E 2 + (4) 2 El + (4 " s 



(21) 



where Si = Ylik=i a gki E2 = 2~2t\i a gki an d C is the normalizing constant given below by ff23l) . 



When s 9 = and £1 = E 2 , the distribution ([21]) is a t-distribution with 3 degrees of freedom, 
mean and width Ei/y3, from which we can sample by standard methods. Otherwise, the 
cumulative distribution function (CDF) of (|2ip can be shown to be: 



F(s 1 t 5 flJ Ei,E s 



3 g > 



1 



r log 



4) 2 + E? 



+ 



p ^arctan f ^- J + - J + 
p s (arctan (\^) + \ 



(22) 



where 



C 



Ps 



7T(Ei 



EiE 2 ( S 2 + (Si + E 2 ) 2 ) 



S ^ + 2(E? + EI) s 2 + (S 2 -E 



2\2 ' 



= 7^ 



*2 " (E? - E 2 , 



Ei 4 + 2 (E? + Si) S 2 + (s 2 - S|) 
1 j + (g - g) 

E 2 S 4 + 2 (E? + E 2) S 2 + (E? _ S 2)2 



(23) 
(24) 

(25) 

(26) 



When s g ^ 0, the derivation of (T22|) uses the equations below from (T2"Tj) to (T291) as follows, 
where p = (a 2 — c)/b,q = b + q,r = pc — a 2 q, and we assume 4c — b 2 > 0, 



1 



1 



If x + p 



x 2 + a 2 x 2 + bx + c r 
- 1 - 1 



x + q 



x 2 + a? x 2 + bx + c 



it + p 
w 2 + a 2 



log(x 2 



u + q 



u- 



bu 



—du = - logfx 2 + bx 
c 2 &v 



p fx 
— arctan — 
a \a 

. 2q-b 



V4c - b 2 



Ti 
2 

arctan 



2x + b 
V4c - b 2 



7T 
2 



(27) 
(28) 
(29) 



When s 9 = 0, the derivation of (1221) uses the following equations: 



S 



1 1 

x 2 + a 2 x 2 + b 2 

f 2 l 2 dU 
J-oo U + a 

Since we can compute the CDF of ( 12TT) with ( |22l) explicitly, we can use the inversion method 
to sample from §ZT$), with the inverse CDF computed by some numerical method. We chose the 
Illinois method (Thisted 1988, Page 171), which is robust and fairly fast. 

When sampling for s\, . . . ,Sq temporarily for each test case is not desired, for example, when 
we need to make predictions for a huge number of test cases at a time, we can still apply the 
above method that splits a Gaussian or Cauchy random variable into two parts n g — 1 times to 
split s g into n g parts. Our method for compressing parameters is still useful because sampling 
from the splitting distributions uses direct sampling methods, which are much more efficient than 
applying Markov chain sampling method to the original parameters. However, we will not save 
space if we take this approach of sampling for all /3's. 



b 2 -a 2 \x 2 + a 2 x 2 + b 2 J ^ 

I ( arctan (D + ^ (31) 



3 Application to Sequence Prediction Models 

In this section, we show how to compress parameters of logistic sequence prediction models in 
which states of a sequence are discrete. We will first define this class of models, and then describe 
the scheme for grouping the parameters. To demonstrate our method, we use a binary data set 
generated using a hidden Markov model, and a data set created from English text, in which 
each state has 3 possibilities (consonant, vowel, and others). These experiments show that our 
compression method produces a large reduction in the number of parameters needed for training 
the model, when the prediction for the next state of a sequence is based on a long preceding 
sequence, i.e., a high-order model. We also show that good predictions on test cases result from 
being able to use a high-order model. 



3.1 Bayesian Logistic Sequence Prediction Models 

We write a sequence of length + 1 as x\, . . . , xo, xo+i, where x t takes values from 1 to K t , for 
t = 1, . . . , O, and xo+i takes values from 1 to K. We call x±, . . . ,xo = xi :Q the historic sequence. 
For subject i we write its historic sequence and response as x^) and Xq +1 . We are interested in 
modelling the conditional distribution P(xo+i | Xv.o)- 

An interaction pattern V is written as . . . Ao], where A t can be from to K t , with 

A t = meaning that Xt can be any value from 1 to K t . For example, [0 ... 01] denotes the 
pattern that fixes xo — 1 and allows x%, . . . , xo-i to be any values in their ranges. When all 
nonzero elements of V are equal to the corresponding elements of a historic sequence, Xi-_o, we 
say that pattern V occurs in x± : o, or pattern V is expressed by X\-o, denoted by X\-o G V . We 
will use the indicator I{x\-o £ V) as a predictor variable, whose coefficient is denoted by (3-p. For 
example, /3[o ..o] is the intercept term. A logistic model assigns each possible value of the response 

(k) 

a linear function of the predictor variables. We use fi v to denote the coefficient associated with 
pattern V and used in the linear function for xq+i = k. 
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Figure 2: A picture of the coefficients, f3, for all patterns in binary sequences of length = 3. 
P1A1A2A3] is associated with the pattern written as [A 1 A 2 A 3 ], with A t — meaning that x t is 
allowed to be either 1 or 2, in other words, x t is ignored in defining this pattern. For example, /5[ooo] 
is the intercept term. These coefficients are used in defining the linear function / ((a?i, X2, £3), /3) in 
the logistic model (132]) . For each combination of (x±, £2, xs) on the left column, / ((x\, X2, Xz),f3) 
is equal to the sum of /3's along the path linked by lines, from /3{ X1X2X3 ] to /3[ooo]- 



For modeling sequences, we consider only the patterns where all zeros (if any) are at the start. 
Let us denote all such patterns by S. We write all coefficients for xo+i = k, i.e., j/?^ | V G Sj, 

collectively as fr- k \ Figure (j2J) displays (3^' for binary sequence of length = 3, for some k, 
placed in a tree-shape. 

Conditional on /3 (1) , . . . , (3 {K) and X\-o, the distribution of xo+i is defined as 

P(x 0+1 = k I X1 . , .... tfO) = -fgg^l^) (32) 



E.iexp(i(x li0 ,/3«>)) 



where 



o 



l(x 1: o,(3^) = E* fe)/ (^ G ^=<o] + EAoV,o] ( 33 ) 
veS t=1 

In Figure [2], we display the linear functions for each possible combination of (xi, X2, £3) on the 
left column, by linking together all /3's in the summation fl33l) with lines, from /3[ XlX2Xa ] to /3[ooo]- 
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The prior for each is a Gaussian or Cauchy distribution centered at 0, whose width depends 
on the order, o(V), of V, which is the number of nonzero elements of V. There are + 1 such 
width parameters, denoted by a"o, . . . , (?o- The er D 's are treated as hyperparameters, assigned 
Inverse Gamma prior distributions with some shape and rate parameters, leaving their values to 
be determined by the data. In summary, the hierarchy of the priors is: 

cr ~ Inverse-Gamma (a D , (a + 1) w ), for o = 0, . . . , O 
(3f | a o(P) ~ Cauchy(0,a o{P) ) or N(0, a 2 o(v) ), for V G S 

where Inverse-Gamma (a, A) denotes an Inverse Gamma distribution with density function 
x -a-i exp(— X/x)/T(a). We express a and A in (I34p so that the mode of the prior is w Q . 



3.2 Remarks on the Sequence Prediction Models 

The Inverse Gamma distributions have heavy upward tails when a is small, and particularly 
when a < 1, they have infinite means. An Inverse Gamma distribution with a Q < 1 and small 
w , favors small values around w Q , but still allows cr Q to be exceptionally large, as needed by 
the data. Similarly, the Cauchy distributions have heavy two-sided tails. The absolute value of 
a Cauchy random variable has infinite mean. When a Cauchy distribution with center and 
a small width is used as the prior for a group of parameters, such as all /3's of the interaction 
patterns with the same order in (1341) . a few parameters may be much larger in absolute value 
than others in this group. As the priors for the coefficients of high-order interaction patterns, 
the Cauchy distributions can therefore express more accurately than the Gaussian distributions 
the prior belief that most high-order interaction patterns are useless in predicting the response, 
but a small number may be important. 

It seems redundant to use a (3 {k) for each k = 1,...,K in (1321) since only the differences 
between (3^ matter in (1321) . A non-Bayesian model could fix one of them, say (3^\ all equal 
to 0, so as to make the parameters identifiable. However, when K ^ 2, forcing (3^ = in a 
Bayesian model will result in a prior that is not symmetric for all k, which we may not be able 
to justify. When K = 2, we do require that (3^ are all equal to 0, as there is no asymmetry 
problem. 

Inclusion of (3-p other than the highest order is also a redundancy, which facilitates the expres- 
sion of appropriate prior beliefs. The prior distributions of linear functions of similar historic 
sequences X\,o are positively correlated since they share some common /5's, for example, in the 
model displayed by Figured / ((1, 1, 1), (3) and I ((2, 1, l),/3) share /3[ n] , /%ioi] an d /%)oo]- Conse- 
quently, the predictive distributions of xo are similar given similar Xi-o- By incorporating such 
a prior belief into our inference, we borrow "statistical strength" for those historic sequences 
with few replications in the training cases from other similar sequences with more replications, 
avoiding making an unreasonably extreme conclusion due to a small number of replications. 
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3.3 Specifications of the Priors and Computation Methods 
3.3.1 The Priors for the Hyperprameters 

We fix o"o at 5 for the Cauchy models and 10 for the Gaussian models. For o > 0, the prior for 
(T is Inverse Gamma(a , (a Q + l)w ), where a Q and w Q are: 

a o = 0.25, w o = 0.1/o, foro=l,...,0 (35) 

The quantiles of Inverse-Gamma(0.25, 1.25 x 0.1), the prior of ax, are shown as follows: 
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The quantiles of other a Q can be obtained by multiplying those of ax by 1/ o. 
3.3.2 The Markov Chain Sampling Method 

We use Gibbs sampling to sample for both the s 9 's (or the f3 g kS when not applying our com- 
pression method) and the hyperparameters, a . These 1-dimensional conditional distributions 
are sampled using the slice sampling method (Neal 2003), summarized as follows. In order to 
sample from a 1-dimensional distribution with density f(x), we can draw points {x, y) from the 
uniform distribution over the set {(x, y) | < y < f(x)}, i.e., the region of the 2-dimensional 
plane between the x-axis and the curve of f(x). One can show that the marginal distribution 
of x drawn this way is f(x). We can use Gibbs sampling scheme to sample from the uniform 
distribution over {(x,y) | < y < f(x)}. Given x, we can draw y from the uniform distribution 
over {y | < y < f(x)}. Given y, we need to draw x from the uniform distribution over the 
"slice", S — {x | f(x) > y}. However, it is generally infeasible to draw a point directly from the 
uniform distribution over 5*. Neal (2003) devises several Markov chain sampling schemes that 
leave this uniform distribution over S invariant. One can show that this updating of x along 
with the previous updating of y leaves f(x) invariant. Particularly we chose the "stepping out" 
plus "shrinkage" procedures. The "stepping out" scheme first steps out from the point in the 
previous iteration, say Xq, which is in S, by expanding an initial interval, J, of size w around 
xq on both sides with intervals of size w, until the ends of / are outside S, or the number of 
steps has reached a pre-specified number, m. To guarantee correctness, the initial interval / is 
positioned randomly around Xq, and m is randomly aportioned for the times of stepping right 
and stepping left. We then keep drawing a point uniformly from / until obtaining an x in S. To 
facilitate the process of obtaining an x in S, we shrink the interval I if we obtain an x not in S 
by cutting off the left part or right part of / depending on whether x < xq or x > xq. 

We set w = 20 when sampling for /3's if we use Cauchy priors, considering that there might be 
two modes in this case, and set w = 10 if we use Gaussian priors. We set w = 1 when sampling 
for er . The value of m is 50 for all cases. We trained the Bayesian logistic sequence model, with 
the compressed or the original parameters, by running the Markov chain 2000 iterations, each 
updating the /3's 1 time, and updating the ex's 10 times, both using slice sampling. The first 
750 iterations were discarded, and every 5th iteration afterward was used to predict for the test 
cases. 
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The above specification of Markov chain sampling and the priors for the hyperparameters will 
be used for all experiments in this paper. 



3.4 Grouping Parameters of Sequence Prediction Models 

In this section, we describe a scheme for dividing the /?'s into a number of groups, based on the 
training data, such that the likelihood function depends only on the sums in groups, as shown 
by @- The likelihood function of (3^ k \ for k = 1, . . . , K, is the product of probabilities in (1321) 
applied to the training cases, x^) Q , Xq +1 , for i — 1, . . . , N (collectively denoted by V). It can be 
written as follows: 

^..■■^ix>)=n „r^:^ ^ 

Note that when K = 2, is fixed at 0, and therefore not included in the above likelihood 
function. But for simplicity, we do not write another expression for K = 2. 

Since the linear functions with different fc's have the same form except the superscript, the way 
we divide (3^ into groups is the same for all k. In the following discussion, j3r- k ' will therefore 
be written as (3, omitting k. 

As shown by (133]) . the function I (xi-.o, 0) is the sum of the /3's associated with the interaction 
patterns expressed by X\-o- If a group of interaction patterns are expressed by the same training 
cases, the associated /3's will appear simultaneously in the same factors of fl36|) . The likelihood 
function (13T)|) therefore depends only on the sum of these /3's, rather than the individual ones. 
Our task is therefore to find the groups of interaction patterns expressed by the same training 
cases. 

Let us use E-p to denote the "expression" of the pattern V — the indices of training cases 
in which V is expressed, a subset of 1,...,N. For example, -Ejo-.-o] = {!>•••> 111 other 
words, the indicator for pattern V has value 1 for the training cases in E-p, and for others. 
We can display E-p in a tree-shape, as we displayed (3p. The upper part of Figure [3] shows such 
expressions for each pattern of binary sequence of length = 3, based on 3 training cases: 
aj^g = (1, 2, l),cc^3 = (2, 1, 2) and = (1, 1, 2). From Figure [3], we can see that the expression 
of a "stem" pattern is equal to the union of the expressions of its "leaf" patterns, for example, 
#[ooo] = #[ooi] U #[002] • 

When a stem pattern has only one leaf pattern with non-empty expression, the stem and 
leaf patterns have the same expression, and can therefore be grouped together. This grouping 
procedure will continue by taking the leaf pattern as the new stem pattern, until encountering 
a stem pattern that "splits", i.e. has more than one leaf pattern with non-empty expression. 
For example, -E^ooi] , #[021] and £'[121] in Figure [3] can be grouped together. All such patterns 
must be linked by lines, and can be represented collectively with a "superpattern" SP, written 
as [0---0A b ---A o ] f = \J b t=f [0---0Af-A o }, where 1 < b < f < O + 1, and in particular 
when £ = + 1, [0 • • • 0A t ■ ■ ■ A Q ] = [0 ■ ■ ■ 0]. One can easily translate the above discussion into 
a computer algorithm. Figure H] describes the algorithm for grouping parameters of Bayesian 
logistic sequence prediction models, in a C-like language, using a recursive function. 
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Training Cases 




After Grouping 




,= {1,2,3} 



Figure 3: A picture showing that the interaction patterns in logistic sequence prediction models 
can be grouped, illustrated with binary sequences of length = 3, based on 3 training cases 
shown in the upper-right box. E-p is the expression of the pattern (or superpattern) V — the 
indices of the training cases in which the V is expressed, with meaning the empty set. We group 
the patterns with the same expression together, re-represented collectively by a "superpattern" , 
written as [0 • ■ ■ 0A b ■ ■ ■ A }j, meaning [j{ =b [0 ■ ■ • 0A t ■ ■ ■ Ao], where l<6</<0 + l, and 
in particular when t = + 1, [0 ■ • ■ 0A t ■ ■ ■ Ao] = [0 ■ ■ ■ 0]. We also remove the patterns not 
expressed by any of the training cases. Only 5 superpatterns with unique expressions are left in 
the lower picture. 



14 



INPUTS: 

N: number of training cases 

O: length of sequences 
X: training data, N X O matrix 
OUTPUTS: 

LSP: list of superpatterns 
LE: list of expressions 

INPUTS of 'DIVERGE' (shown on the right): 

E: expression, subset of 1, ... ,N 

SP: superpattern, structure with members 
P: pattern, vector of length O 

f : index of fixed state in P 

ALGORITHM: 
E= {1, ... ,N} 

SP.f = 0+ 1, SP.P=(0, ... ,0) 
LSP = NULL 
LE = NULL 
DIVERGE(E,SP) 
RETURN LE and LSP 



DIVERGE(E, SP) 



b = SP.f- 1 

while ( # of unique values in X[E][b] is 1 & b >0) 
{ SP.P [b] = the unique value in X[E] [b] 
b = b- 1 

1 

Add SP to LSP 
Add E to LE 
if ( b > ) 

{ for( x in unique values in X[E][b] ) 
{ SubE={iinE:X[i][b]=x} 
Set NSP = SP 
NSP.f = b , NSP.P[b] = x 
DIVERGE(SubE, NSP) 



Figure 4: The algorithm for grouping parameters of Bayesian logistic sequence prediction models. 
To group parameters, we call function "DIVERGE" with the initial values of expression E = 
{1, . . . , N} and superpattern SP = [0 . . . 0]o+i, as shown in above picture, resulting in two lists 
of the same length, LE and LSP, respectively storing the expressions and the corresponding 
superpatterns. Note that the first index of an array is assumed to be 1, and that the Xfi?]^] 
means a 1-dimension subarray of X in which the row indices are in E and the column index 
equals b. 
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An important property of our method for compressing parameters of sequence prediction 
models is that given N sequences as training data, conceivably of infinite length, denoted by 
X-oo, . . . , for % = 1, . . . , TV, the number of superpatterns with unique expressions, and ac- 
cordingly the number of compressed parameters, will converge to a finite number as O increases. 
The justification of this claim is that if we keep splitting the expressions following the tree shown 
in Figure [3j at a certain time, say t, every expression will be an expression with only 1 element 
(suppose we in advance remove the sequences that are identical with another one). When consid- 
ering further smaller t, no more new superpattern with different expressions will be introduced, 
and the number of superpatterns will not grow. The number of the compressed parameters, the 
regression coefficients for the superpatterns, will therefore not grow after the time t. 

In contrast, after the time t when each interaction pattern is expressed by only 1 training case, 
if the order is increased by 1, the number of interaction patterns is increased by the number 
of training cases. The regression coefficients associated with these original interaction patterns, 
called the original parameters thereafter, will grow linearly with the order considered. Note that 
these original parameters do not include the regression coefficients for those interaction patterns 
not expressed by any training case. The total number of regression coefficients defined by the 
model grows exponentially with the order considered. 



3.5 Making Prediction for a Test Case 

Given (3^\ . . . , (3r \ the predictive probability for the next state x* 0+l of a test case for which we 
know the historic sequence x\. can be computed using equation ( 1321) . applied to x\. . A Monte 
Carlo estimate of P{x* 0+1 = k | x\. ,T>) can be obtained by averaging (1321) over the Markov 
chain samples from the posterior distribution of f3^\ . . . , (r- K \ 

Each of the + 1 patterns expressed by the test is either expressed by some training 

case (and therefore belongs to one of the superpatterns), or is a new pattern (not expressed 
by any training case). Suppose we have found 7 superpatterns. The + 1 /3's in the linear 
function l(x\. , (3^) can accordingly be divided into 7 + 1 groups (some groups may be empty). 
The function l{x\. , (3^) can be written as the sum of the sums of the /3's over these 7 + 1 
groups. Consequently, P(x* 0+l = k \ x\. ) can be written in the form of (TT7|) . As discussed 
in Section \2A\ we need to only split the sum of the /3's associated with a superpattern, i.e., a 
compressed parameter s g , into two parts, such that one of them is the sum of those (3 expressed 
by the test using the splitting distribution (I19p . 

It is easy to identify the patterns that are also expressed by x\. Q from a superpattern 
[0 • ■ ■ Af, ■ ■ ■ Ao]f. If (Xf, . . . , x* ) 7^ (Af, . . . , Ao), none of the patterns in [0 • ■ ■ Aj, ■ ■ ■ Ao]f are 
expressed by x\. 0) otherwise, if . . . , x* Q ) = {Ay, . . . , Ao) for some b' (b < b' < f), all patterns 
in [0 • • • Ay ■ ■ ■ Aq] f are expressed by x* vo . 



3.6 Experiments with a Hidden Markov Model 

In this section we apply Bayesian logistic sequence prediction modeling, with or without our 
compression method, to data sets generated using a Hidden Markov model, to demonstrate our 
method for compressing parameters. The experiments show that when the considered length 
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of the sequence is increased, the number of compressed parameters will converge to a fixed 
number, whereas the number of original parameters will increase linearly. Our compression 
method also improves the quality of Markov chain sampling in terms of autocorrelation. We 
therefore obtain good predictive performances in a small amount of time using long historic 
sequences. 

3.6.1 The Hidden Markov Model Used to Generate the Data 

Hidden Markov models (HMM) are applied widely in many areas, for example, speech recognition 
(Baker 1975), image analysis (Romberg et.al. 2001), computational biology (Sun 2006). In 
a simple hidden Markov model, the observable sequence {x t \ t = 1, 2, . . .} is modeled as a 
noisy representation of a hidden sequence {ht | t — 1, 2, . . .} that has the Markov property (the 
distribution of h t given h t -i is independent with the previous states before h t -\). Figure [5] 
displays the hidden Markov model used to generate our data sets, showing the transitions of 
three successive states. The hidden sequence h t is an Markov chain with state space {1, . . . , 8}, 
whose dominating transition probabilities are shown by the arrows in Figure each of which 
is 0.95. However, the hidden Markov chain can move from any state to any other state as well, 
with some small probabilities. If h t is an even number, xt will be equal to 1 with probability 
0.95 and 2 with probability 0.05, otherwise, Xt will be equal to 2 with probability 0.95 and 1 
with probability 0.05. The sequence {xt \ t = 1,2,...} generated by this exhibits high-order 
dependency, though the hidden sequence is only a Markov chain. We can see this by looking at 
the transitions of observable Xt in Figure [5j For example, if x± = 1 (rectangle) and X2 = 2 (oval), 
it is most likely to be generated by hi = 2 and Y12 = 3, since this is the only strong connection 
from the rectangle to the oval, consequently, h 3 = 8 is most likely to to be the next, and £3 is 
therefore most likely to be 1 (rectangle). 

3.6.2 Experiment Results 

We used the HMM in Figure |5] to generate 5500 sequences with length 21. We used 5000 
sequences as test cases, and the remaining 500 as the training cases. We tested the prediction 
methods by predicting X21 based on varying numbers of preceding states, O, chosen from the set 
{1,2,3,4,5,7, 12,15,17,20}. 

Figure [6] compares the number of parameters and the times used to train the model, with 
and without our compression method. It is clear that our method for compressing parameters 
reduces greatly the number of parameters. The ratio of the number of compressed parameters to 
the number of the original ones decreases with the number of preceding states, O. For example, 
the ratio reaches 0.207 when O = 20. This ratio will reduce to when considering even bigger 
O, since the number of original parameters will grow with O while the number of compressed 
parameters will converge to a finite number, as discussed in Section 13.41 There are similar 
reductions for the training times with our compression method. But the training time with 
compressed parameters will not converge to a finite amount, since the time used to update the 
hyperparameters (a 's) grows with order, O. Figure [6] also shows the prediction times for 5000 
training cases. The small prediction times show that the methods for splitting Gaussian and 
Cauchy variables are very fast. The prediction times grow with O because the time used to 
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hj h 2 h 3 




Figure 5: A picture showing a Hidden Markov Model, which is used to generate sequences 
to demonstrate Bayesian logistic sequence prediction models. Only the dominating transition 
probabilities of 0.95 are shown using arrows in the above graph, while from any state the hidden 
Markov chain can also move to any other state with a small probability. When ht is in a rectangle, 
x t is equal to 1 with probability 0.95, and 2 with probability 0.05, otherwise, when h t is in an 
oval, x t is equal to 2 with probability 0.95, and 1 with probability 0.05. 

identify the patterns in a superpattern expressed by a test case grows with O. The prediction 
times with the original parameters are not shown in Figure El since we do not claim that our 
compression method saves prediction time. (If we used the time-optimal programming method 
for each method, the prediction times with compressed parameters should be more than without 
compressing parameters since the method with compression should include times for identifying 
the patterns from the superpattern for test cases. With our software, however, prediction times 
with compression are less than without compression, which is not shown in Figure El because the 
method without compression needs to repeatedly read a huge number of the original parameters 
into memory from disk.) 

Compressing parameters also improves the quality of Markov chain sampling. Figure [7] shows 
the autocorrelation plots of the hyperparameters a , for o = 10, 12, 15, 17, 20, when the length of 
the preceding sequence, O, is 20. It is clear that the autocorrelation decreases more rapidly with 
lag when we compress the parameters. This results from the compressed parameters capturing 
the important directions of the likelihood function (i.e. the directions where a small change can 
result in large a change of the likelihood). We did not take the time reduction from compressing 
parameters into consideration in this comparison. If we rescaled the lags in the autocorrelation 
plots according to the computation time, the reduction of autocorrelation of Markov chains with 
the compressed parameters would be much more pronounced. 
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Figure 6: Plots showing the reductions of the number of parameters and the training time with 
our compression method using the experiments on a data set generated by a HMM. The upper- 
left plot shows the number of the compressed and the original parameters based on 500 training 
sequences for O = 1, 2, 3, 4, 5, 7, 10, 12, 15, 17, 20, their ratios are shown in the upper-right plot. 
In the above plots, the lines with o are for the methods with parameters compressed, the lines 
with x are for the methods without parameters compressed, the dashed lines are for the methods 
with Gaussian priors, and the dotted lines are for the methods with Cauchy priors. The lower-left 
plot shows the training times for the methods with and without parameters compressed. The 
lower-right plot shows the prediction time only for the methods with parameters compressed. 
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Figure 7: The autocorrelation plots of cr 's for the experiments on a data set generated by a 
HMM, when the length of the preceding sequence O = 20. We show the autocorrelations of 
a Q , for o = 10, 12, 15, 17,20. In the above plots, "Gaussian" in the titles indicates the methods 
with Gaussian priors, "Cauchy" indicates with Cauchy priors, "comp" indicates with parameters 
compressed, "no comp" indicates without parameters compressed. 
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Error Rates with Cauchy Priors 
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Figure 8: Plots showing the predictive performance using the experiments on a data set generated 
by a HMM. The left plots show the error rates and the right plots show the average minus log 
probabilities of the true responses in the test cases. The upper plots show the results when using 
the Cauchy priors and the lower plots shows the results when using the Gaussian priors. In all 
plots, the lines with o are for the methods with parameters compressed, the lines with x are for 
the methods without parameters compressed. The numbers of the training and test cases are 
respectively 500 and 5000. The number of classes of the response is 2. 
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Finally, we evaluated the predictive performance in terms of error rate (the fraction of wrong 
predictions in test cases), and the average minus log probability (AMLP) of observing the true 
response in a test case based on the predictive probability for different classes. The performance 
of with and without compressing parameters are the same, as should be the case in theory, and 
will be in practice when the Markov chains for the two methods converge to the same modes. 
Performance of methods with Cauchy and Gaussian priors is also similar for this example. The 
predictive performance is improved when O goes from 1 to 5. When O > 5 the predictions are 
slightly worse than with O = 5 in terms of AMLP. The error rates for O > 5 are almost the same 
as for = 5. This shows that the Bayesian models can perform reasonably well even when we 
consider a very high order, as they avoid the overfitting problem in using complex models. We 
therefore do not need to restrict the order of the Bayesian sequence prediction models to a very 
small number, especially after applying our method for compressing parameters. 

3.7 Experiments with English Text 

We also tested our method using a data set created from an online article from the website of 
the Department of Statitics, University of Toronto. In creating the data set, we encoded each 
character as 1 for vowel letters (a,e,i,o,u), 2 for consonant letters, and 3 for all other characters, 
such as space, numbers, special symbols, and we then collapsed multiple occurrences of "3" into 
only 1 occurrence. The length of the whole sequence is 3930. Using it we created a data set with 
3910 overlaped sequences of length 21, and used the first 1000 as training data. 

The experiments were similar to those in Section 13.61 with the same priors and the same 
computational specifications for Markov chain sampling. Figures [TUJ [TTJ and [12] show the 
results. All the conclusions drawn from the experiments in Section 13.61 are confirmed in this 
example, with some differences in details. In summary, our compression method reduces greatly 
the number of parameters, and therefore shortens the training process greatly. The quality of 
Markov chain sampling is improved by compressing parameters. Prediction is very fast using our 
splitting methods. The predictions on the test cases are improved by considering higher order 
interactions. From Figure [TTJ at least some order 10 interactions are useful in predicting the 
next character. 

In this example we also see that when Cauchy priors are used Markov chain sampling with the 
original parameters may have been trapped in a local mode, resulting in slightly worse predictions 
on test cases than with the compressed parameters, even though the models used are identical. 

We also see that the models with Cauchy priors result in better predictions than those with 
Gaussian priors for this data set, as seen from the plots of error rates and AMLPs. To investigate 
the difference of using Gaussian and Cauchy priors, we first plotted the medians of Markov 
chains samples (in the last 1250 iteractions) of all compressed parameters, s, for the model 
with O = 10, shown in Figure [T2J where the right plot shows in a larger scale the rectangle 
(—2, 2) x (—2, 2). This figure shows that a few (3 with large medians in the Cauchy model have 
very small corresponding medians in the Gaussian model. 

We also looked at the traces of some compressed parameters, as shown in Figure [131 The three 
compressed parameters shown all contain only a single (3. The plots on the top are for the f3 
for "CC:V", used for predicting whether the next character is a vowel given the preceding two 
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Figure 9: Plots showing the reductions of the number of parameters and the training and predic- 
tion time with our compression method using the experiments on English text. The upper- left 
plot shows the number of the compressed and the original parameters based on 500 training se- 
quences for = 1,2, 3, 4, 5, 7, 10, 12, 15, 17, 20, their ratios are shown in the upper-right plot. In 
the above plots, the lines with o are for the methods with parameters compressed, the lines with 
x are for the methods without parameters compressed, the dashed lines are for the methods with 
Gaussian priors, and the dotted lines are for the methods with Cauchy priors. The lower-left 
plot shows the training times for the methods with and without parameters compressed. The 
lower-right plot shows the prediction time only for the methods with parameters compressed. 
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Figure 10: The autocorrelation plots of the cr 's for the experiments on English text data, when 
the length of the preceding sequence O = 20. We show the autocorrelation plot of a a , for 
o = 10, 12, 15, 17, 20. In the above plots, "Gaussian" in the titles indicates the methods with 
Gaussian priors, "Cauchy" indicates with Cauchy priors, "comp" indicates with parameters 
compressed, "no comp" indicates without parameters compressed. 
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Figure 11: Plots showing the predictive performance using the experiments on English text data. 
The left plots show the error rate and the right plots show the average minus log probability of 
the true response in a test case. The upper plots show the results when using the Cauchy priors 
and the lower plots shows the results when using the Gaussian priors. In all plots, the lines with 
o are for the methods with parameters compressed, the lines with x are for the methods without 
parameters compressed. The numbers of the training and test cases are respectively 1000 and 
2910. The number of classes of the response is 3. 
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Figure 12: Scatterplots of medians of all compressed parameters, s, of Markov chain samples in 
the last 1250 iterations, for the models with Cauchy and Gaussian priors, fitted with English 
text data, with the length of preceding sequence O = 10, and with the parameters compressed. 
The right plot shows in a larger scale the rectangle (—2,2) x (—2,2). 

characters are consonants; the plots in the middle are for "__CC:V" , where denotes a space 
or special symbol; the plots on the bottom are for "CCVCVCC:V" , which had the largest median 
among all compressed parameters in the Cauchy model, as shown by Figure [T2j The regression 
coefficient (3 for "CC:V" should be close to by our common sense, since two consonants can be 
followed by any of three types of characters. We can very commonly see "CCV" , such as "the" , 
and "CC__", such as "with__", and not uncommonly see "CCC", such as "technique" , "world" , 
etc. The Markov chain trace of this j3 with a Cauchy prior moves in a smaller region around 
than with a Gaussian prior. But if we look back one more character, things are different. The 
regression coefficient (3 for "__CC:V" is fairly large, which is not surprising. The two consonants 
in "__CC:V" stand for two letters in the beginning of a word. We rarely see a word starting with 
three consonants or a word consisting of only two consonants. The posterior distribution of this 
9 for both Cauchy and Gaussian models favor positive values, but the Markov chain trace for 
the Cauchy model can move to much larger values than for the Gaussian model. As for the high- 
order pattern "CCVCVCC" , it matches words like "statistics" or "statistical" , which repeatedly 
appear in an article introducing a statisics department. Again, the Markov chain trace of this (3 
for the Cauchy model can move to much larger values than for Gaussian model, but sometimes 
it is close to 0, indicating that there might be two modes for its posterior distributution. 

The above investigation reveals that a Cauchy model allows some useful (3 to be much larger 
in absolute value than others while keeping the useless (3 in a smaller region around than a 
Gaussian model. In other words, Cauchy models are more powerful in finding the information 
from the many possible high-order interactions than Gaussian models, due to the heavy two-sided 
tails of Cauchy distributions. 
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Figure 13: Plots of Markov chain traces of three compressed parameters (Each contains only one 
(3) from Experiments on English text with 10 preceding states, with Cauchy or Gaussian priors. 
The parameters are annotated by their original meanings in English sequence. For example, 
'__CC:V stands for the parameter for predicting that the next character is a "vowel" given 
preceding three characters are "space, consonant, consonant". 



27 



4 Conclusion and Discussion 



In this paper, we have proposed a method to effectively reduce the number of parameters of 
Bayesian classification and regression models with high-order interactions, using a compressed 
parameter to represent the sum of all the regression coefficients for the predictor variables that 
have the same values for all the training cases. Working with these compressed parameters, we 
greatly shorten the training time with MCMC. These compressed parameters can later be split 
into the original parameters efficiently We have demonstrated, theoretically and empirically, 
that given a data set with fixed number of cases, the number of compressed parameters will 
have converged before considering the highest possible order. Applying Bayesian methods to 
regression and classification models with high-order interactions therefore become much easier 
after compressing the parameters, as shown by our experiments with simulated and real data. 
The predictive performance will be improved by considering high-order interactions if some useful 
high-order interactions do exist in the data. 

We have devised an efficient scheme for compressing parameters of Bayesian logistic sequence 
prediction models, as described in Section [31 The algorithm for sequence prediction models is 
efficient. The resulting groups of interaction patterns have unique expressions. We have also 
found similar schemes for compressing parameters of general Bayesian classification models with 
discrete features, though it is more difficult, see (Li 2007). 

We have also empirically demonstrated that Cauchy distributions with location parameter 0, 
which have heavy two-sided tails, are more appropriate than Gaussian distributions in capturing 
the prior belief that most of the parameters in a large group are very close to but a few of 
them may be much larger in absolute value, as we may often think appropriate for the regression 
coefficients in certain problems. 

We have implemented the compression method only for classification models in which the 
response and the features are both discrete. Without any difficulty, the compression method can 
be used in regression models in which the response is continuous but the features are discrete, 
for which we need only use another distribution to model the continuous response variable, for 
example, a Gaussian distribution. Unless one converts the continuous features into discrete 
values, it is not clear how to apply the method described in this paper to continuous features. 
However it seems possible to apply the more general idea that we need to work only with those 
parameters that matter in the likelihood function when training models with MCMC, probably 
by transforming the original parameters. 
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